Fast Runtime Block Cyclic Data Redistribution on Multiprocessors

نویسندگان

  • Loïc Prylli
  • Bernard Tourancheau
چکیده

Block cyclic distribution seems to suit well for most linear algebra algorithms and this type of data distribution was chosen for the ScaLAPACK library as well as for the HPF language. But one has to choose a good compromise for the size of the blocks (to achieve a good computation and communication eeciency and a good load balancing). This choice heavily depends on each operation, so it is essential to be able to go from one block cyclic distribution to another very quickly. Moreover, it is also essential to be able to choose the right number of processors and the best grid shape for a given operation. We present here the data redistribution algorithms we implemented in the ScaLAPACK library in order to go from one block cyclic distribution on a grid to another one on another grid. A complexity study is made that shows the eeciency of our solution. Timing results on the Intel Paragon and the Cray T3D corroborate our results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Runtime Array Redistribution in HPF Programs

This paper describes eecient algorithms for run-time array redistribution in HPF programs. We consider block(m) to cyclic, cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions. We initially describe algorithms for one-dimensional arrays and then extend the methodology to multidimen-sional arrays. The algorithms are practical enough to be easily implemented in the runti...

متن کامل

Runtime Array Redistribution in HPF

This paper describes eecient algorithms for run-time array redistribution in HPF programs. We consider block(m) to cyclic, cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions. We initially describe algorithms for one-dimensional arrays and then extend the methodology to multidimen-sional arrays. The algorithms are practical enough to be easily implemented in the runti...

متن کامل

A Generalized Processor Mapping Technique for Array Redistribution

ÐIn many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistribu...

متن کامل

An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Redistribution

AbstractDynamic data distribution is used to enhance data locality and algorithm performance with reducing inter-processor communication in data parallel programs on distributed memory multi-computers. Since the exchange of data is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of ex...

متن کامل

Efficient FFT mapping on GPU for radar processing application: modeling and implementation

General-purpose multiprocessors (as, in our case, Intel IvyBridge and Intel Haswell) increasingly add GPU computing power to the former multicore architectures. When used for embedded applications (for us, Synthetic aperture radar) with intensive signal processing requirements, they must constantly compute convolution algorithms, such as the famous Fast Fourier Transform. Due to its ”fractal” n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 45  شماره 

صفحات  -

تاریخ انتشار 1997